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Abstract 

We consider the problem of Bayesian parameter estimation for deep neural net¬ 
works, which is important in problem settings where we may have little data, and/ 
or where we need accurate posterior predictive densities p(y\x, V ), e.g., for appli¬ 
cations involving bandits or active learning. One simple approach to this is to use 
online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynam¬ 
ics). Unfortunately, such a method needs to store many copies of the parameters 
(which wastes memory), and needs to make predictions using many versions of 
the model (which wastes time). 

We describe a method for “distilling” a Monte Carlo approximation to the pos¬ 
terior predictive density into a more compact form, namely a single deep neural 
network. We compare to two very recent approaches to Bayesian neural networks, 
namely an approach based on expectati on propagation 1HLA151 and an approach 
based on variational Bayes 1BCKW151 . Our method performs better than both of 
these, is much simpler to implement, and uses less computation at test time. 

1 Introduction 

Deep neural networks (DNNs) have recently been achieving state of the art results in many fields. 
However, their predictions are often over confident, which is a problem in applications such as 
active learning, reinforcement learning (including bandits), and classifier fusion, which all rely on 
good estimates of uncertainty. 

A principled way to tackle this problem is to use Bayesian inference. Specifically, we first com¬ 
pute the posterior distribution over the model parameters, p(Q\Vn) oc p(0) n^Li p{Vi\ x u #)> where 
T*n — i? G X D is the i’th input (where D is the number of features), and 

Hi G y is the i 9 th output. Then we compute the posterior predictive distribution, p(y\x^Vjsf) = 
f p(y\x, 0)p(0\V]sf)d0, for each test point x. 

For reasons of computational speed, it is common to approximate the posterior distribution by a 
point estimate such as the MAP estimate, 0 = argmaxp(#|D/v). When N is large, we often use 
stochastic gradient descent (SGD) to compute 0. Finally, we make a plug-in approximation to the 
predictive distribution: p(y\x, V^) ~ p(y\x,6). Unfortunately, this loses most of the benefits 
of the Bayesian approach, since uncertainty in the parameters (which induces uncertainty in the 
predictions) is ignored. 

Various ways of more accurately approximating p{6 \T>n) (and hence p(y\x, T>n)) have been devel¬ 
oped. Recently, I HLA15 1 proposed a method called “probabilistic backpropagation” (PBP) based 
on an online version of expectation propagation (EP), (i.e., using repeated assumed density filtering 
(ADF)), where the posterior is approximated as a product of univariate Gaussians, one per parame- 
ter: p(Q \V N ) « q(6) = n» U(6i\m u vi). 

An alternative to EP is variational Bayes (VB) where we optimize a lower bound on the marginal 
likelihood. I lGralU presented a (biased) Monte Carlo estimate of this lower bound and applies 
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his method, called “variational inference” (VI), to infer the neural network weights. More recently, 
1BCKW151 proposed an approach called “Bayes by Backprop” (BBB), which extends the VI method 
with an unbiased MC estimate of the lower bound based on the “reparameterization trick” of 1KW14I 
RMW14 1. In both I Grai l! and IBCKW151 . the posterior is approximated by a product of univariate 
Gaussians. 

Although EP and VB scale well with data size (since they use online learning), there are several 
problems with these methods: (1) they can give poor approximations when the posterior p(0\T>n) 
does not factorize, or if it has multi-modality or skew; (2) at test time, computing the predictive 
density p(y\x, V at) can be much slower than using the plug-in approximation, because of the need 
to integrate out the parameters; (3) they need to use double the memory of a standard plug-in method 
(to store the mean and variance of each parameter), which can be problematic in memory-limited 
settings such as mobile phones; (4) they can be quite complicated to derive and implement. 

A common alternative to EP and VB is to use MCMC methods to approximate p(0\Vjsr). Tra¬ 
ditional MCMC methods are batch algorithms, that scale poorly with dataset size. However, re¬ 
cently a method called stochastic gradient Langevin dynamics (SGLD) I WT11 1 has been devised 
that can draw samples approximately from the posterior in an online fashion, just as SGD updates a 
point estimate of the parameters online. Furthermore, various extensions of SGLD have been pro¬ 
posed, including stochastic gradient hybrid Monte Carlo (SGHMC) I ICFG14IL stochastic gradient 
Nose-Hoover Thermostat (SG-NHT) I DFB+14 1 (which improves upon SGHMC), stochastic gra¬ 
dient Fisher scoring (SGFS) 1AKW121 (which uses second order information), stochastic gradient 
Riemannian Langevin Dynamics 1PT13IL distributed SGLD CASW14 1. etc. However, in this paper, 
we will just use “vanilla” SGLD I WTll 'lQ 

All these MCMC methods (whether batch or online) produce a Monte Carlo approximation to the 
posterior, q(6) = ^ Yls= l — 0 s ), where S is the number of samples. Such an approxima¬ 
tion can be more accurate than that produced by EP or VB, and the method is much easier to 
implement (for SGLD, you essentially just add Gaussian noise to your SGD updates). However, 
at test time, things are S times slower than using a plug-in estimate, since we need to compute 
q(y\x) = ^ ^2g=iP(y\ x ^ 0 s )' anc * the memory requirements are S times bigger, since we need to 
store the 6 s . (For our largest experiment, our DNN has 500k parameters, so we can only afford to 
store a single sample.) 

In this paper, we propose to train a parametric model S(y\x,w) to approximate the Monte Carlo 
posterior predictive distribution q(y\x) in order to gain the benefits of the Bayesian approach while 
only using the same run time cost as the plugin method. Following 1HVD141L we call q(y\x) the 
“teacher” and S(y\x,w) the “student”. We use SGLE^Jto estimate q(6) and hence q(y\x) online; 
we simultaneously train the student online to minimize KL(q(y\x) \ \S(y\x, w)). We give the details 
in Section [2] 

Similar ideas have been proposed in the past. In particular, I SG05 1 also trained a parametric student 
model to approximate a Monte Carlo teacher. However, they used batch training and they used 
mixture models for the student. By contrast, we use online training (and can thus handle larger 
datasets), and use deep neural networks for the student. 

HHVD141 I also trained a student neural network to emulate the predictions of a (larger) teacher net¬ 
work (a process they call “distillation”), extending earlier work of I BCNM06 1 which approximated 
an ensemble of classifiers by a single one. The key difference from our work is that our teacher 
is generated using MCMC, and our goal is not just to improve classification accuracy, but also to 
get reliable probabilistic predictions, especially away from the training data. 1HVD14I coined the 
term “dark knowledge” to represent the information which is “hidden” inside the teacher network, 
and which can then be distilled into the student. We therefore call our approach “Bayesian dark 
knowledge”. 


1 We did some preliminary experiments with SG-NHT for fitting an MLP to MNIST data, but the results 
were not much better than SGLD. 

2 Note that SGLD is an approximate sampling algorithm and introduces a slight bias in the predictions of 
the teacher and student network. If required, we can replace SGLD with an exact MCMC method (e.g. HMC) 
to get more accurate results at the expense of more training time. 
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In summary, our contributions are as follows. First, we show how to combine online MCMC meth¬ 
ods with model distillation in order to get a simple, scalable approach to Bayesian inference of the 
parameters of neural networks (and other kinds of models). Second, we show that our probabilistic 
predictions lead to improved log likelihood scores on the test set compared to SGD and the recently 
proposed EP and VB approaches. 


2 Methods 


Our goal is to train a student neural network (SNN) to approximate the Bayesian predictive distri¬ 
bution of the teacher, which is a Monte Carlo ensemble of teacher neural networks (TNN). 


If we denote the predictions of the teacher by p(y\x, Vn) and the parameters of the student network 
by w, our objective becomes 

L(w\x) = KL(p(y\x,T> N )\\S(y\x,w)) = -E p{y \ XtT > N) logS(y\x,w) + const 


= -/ Jp(y\x,0)p(0\D N )d0 \ogS(y\x,w)dy 
= ~ J P(0\D N ) J p(y\x, 6) \ogS(y\x,w)dy dd 

= - Jp(p\D N ) [Ep ( j/|»,e) logS(y\x,w)] dO 

; t 

L(w\x) = -|k ^2 E P (y\x, 0 ’)logS(y\x,w) 


( 1 ) 

Unfortunately, computing this integral is not analytically tractable. However, we can approximate 
this by Monte Carlo: 


( 2 ) 


e s e@ 


where © is a set of samples from p(Q\T> n). 


To make this a function just of w 9 we need to integrate out x. For this, we need a dataset to train 
the student network on, which we will denote by V. Note that points in this dataset do not need 
ground truth labels; instead the labels (which will be probability distributions) will be provided 
by the teacher. The choice of student data controls the domain over which the student will make 
accurate predictions. For low dimensional problems (such as in Section we can uniformly 
sample the input domain. For higher dimensional problems, we can sample “near” the training 
data, for example by perturbing the inputs slightly. In any case, we will compute a Monte Carlo 
approximation to the loss as follows: 

L(w) = j p(x)L(w\x)dx ~ jTy ^2 L{w\x') 

J ' I x'ev 

~ J2 ^p(y\x',O’) log S(y\x',w) (3) 

111 1 o s e@x'ev' 


It can take a lot of memory to pre-compute and store the set of parameter samples © and the set of 
data samples V', so in practice we use the stochastic algorithm shown in Algorithm [I] which uses a 
single posterior sample 0 s and a minibatch of x r at each step. 

The hyper-parameters A and 7 from Algorithm 1 control the strength of the priors for the teacher 
and student networks. We use simple spherical Gaussian priors (equivalent to L 2 regularization); 
we set the precision (strength) of these Gaussian priors by cross-validation. Typically A >> 7, since 
the student gets to “see” more data than the teacher. This is true for two reasons: first, the teacher 
is trained to predict a single label per input, whereas the student is trained to predict a distribution, 
which contains more information (as argued in 1HVD141 ); second, the teacher makes multiple passes 
over the same training data, whereas the student sees “fresh” randomly generated data V at each 
step. 


2.1 Classification 

For classification problems, each teacher network 0 s models the observations using a standard soft- 
max model, p(y = k\x, 0 s ). We want to approximate this using a student network, which also has a 
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Algorithm 1: Distilled SGLD 

Input: T>n = yi)}f=\* minibatch size M, number of iterations T, teacher learning schedule 

rjt , student learning schedule p t , teacher prior A, student prior 7 
for t = 1 : T do 

// Train teacher (SGLD step) 

Sample minibatch indices S C [1, N] of size M 
Sample z t ~ A/"(0, rj t I) 

Update 9 t +i := 9 t + 3f (Ve logp(6»|A) + ^ J2ies'^e^ogp(y i \x i ,e)) + z t 

// Train student (SGD step) 

Sample V of size M from student data generator 

^t+i • Pt ^-jx'et>' ^ wL{w 1 @t+i \x ) T 


softmax output, S(y = k\x, w). Hence from Eqn.[2j our loss function estimate is the standard cross 
entropy loss: 

K 

L(w\6 s ,x) = -^2p(y = k\x, 9 s ) \ogS(y = k\x,w) (4) 

k =1 

The student network outputs Pk(x, w ) = log<S(?/ = k\x,w). To estimate the gradient w.r.t. w, we 
just have to compute the gradients w.r.t. (3 and back-propagate through the network. These gradients 

are given by = ~ p ( y = k \ x ’ 9 *)• 


2.2 Regression 


In regression, the observations are modeled as p(yi\xi, 0) = J\T(yi\f(x i \6) 1 A” 1 ) where f(x\0) is 
the prediction of the TNN and A n is the noise precision. We want to approximate the predictive 
distribution as p(y\x, Vn) ~ S(y\x, w) = Af(y\p(x, w), e a ^ x,w ^). We will train a student network 
to output the parameters of the approximating distribution p(x,w) and a(x,w); note that this is 
twice the number of outputs of the teacher network, since we want to capture the (data dependent) 
variance 0 We use e a ( x,w>} instead of directly predicting the variance a 2 (x\w) to avoid dealing with 
positivity constraints during training. 


To train the SNN, we will minimize the objective defined in Eqn. |2j 

L(w\9 s ,x) = -Ep^fs) logAf(y\p,(x,w),e a( - x ’ w '>) 


= -E, 


' p{y\x,O s ) 


a(x, w) + e 


-a(x ,w)f _ 


(y-fj,(x,w) ) 


a(x,w) + e a{ - x ' w) |(/(a;|6» s ) - y(x,w)) 2 + L j 


Now, to estimate V w L(w, 0 s \x), we just have to compute and da d ^ w ) > and back propagate 

through the network. These gradients are: 


dL(w , 0 s \x) 
dp(x, w) 
dL(w , 6 s \x 
da(x , w) 


e -a(x,w) 

\ [l - e~ a ^ |(/(x|r) - p(x,w)) 2 + L J 


(5) 

( 6 ) 


3 Experimental results 


In this section, we compare SGLD and distilled SGLD with other a pproxima te inference methods, 
including the plugin approximation using SGD, the PBP approach of 1HLA15IL the BBB approach of 

3 This is not necessary in the classification case, since the softmax distribution already captures uncertainty. 
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Dataset 

N 

D 

y 

PBP 

BBB 

HMC 

ToyClass 

20 

2 

{0,1} 

N 

N 

Y 

MNIST 

60k 

784 

{0,... ,9} 

N 

Y 

N 

ToyReg 

10 

1 

R 

Y 

N 

Y 

Boston Housing 

506 

13 

R 

Y 

N 

N 


Table 1: Summary of our experimental configurations. 



(a) (b) (c) 



(d) (e) (f) 


Figure 1: Posterior predictive density for various methods on the toy 2d dataset, (a) SGD (plugin) 
using the 2-10-2 network, (b) HMC using 20k samples, (c) SGLD using lk samples, (d-f) Distilled 
SGLD using a student network with the following architectures: 2-10-2, 2-100-2 and 2-10-10-2. 


1BCKW151 . and Hamiltonian Monte Carlo (HMC) INealll . which is considered the “gold standard” 
for MCMC for neural nets. We implemented SGD and SGLD using the Torch library ( torch . ch). 
For HMC, we used Stan (mc-stan . org). We perform this comparison for various classification 
and regression problems, as summarized in Table |l| 4 | 

3.1 Toy 2d classification problem 

We start with a toy 2d binary classification problem, in order to visually illustrate the performance 
of different methods. We generate a synthetic dataset in 2 dimensions with 2 classes, 10 points per 
class. We then fit a multi layer perceptron (MLP) with one hidden layer of 10 ReLu units and 2 
softmax outputs (denoted 2-10-2) using SGD. The resulting predictions are shown in Figure [lja). 
We see the expected sigmoidal probability ramp orthogonal to the linear decision boundary. Unfor¬ 
tunately, this method predicts a label of 0 or 1 with very high confidence, even for points that are far 
from the training data (e.g., in the top left and bottom right corners). 

In Figure[ljb), we show the result of HMC using 20k samples. This is the “true” posterior predictive 
density which we wish to approximate. In Figure[]Jc), we show the result of SGLD using about 1000 
samples. Specifically, we generate 100k samples, discard the first 2k for burnin, and then keep every 
100’th sample. We see that this is a good approximation to the HMC distribution. 

In Figures [TJd-f), we show the results of approximating the SGLD Monte Carlo predictive distribu¬ 
tion with a single student MLP of various sizes. To train this student network, we sampled points at 
random from the domain of the input, [—10,10] x [—10,10]; this encourages the student to predict 
accurately at all locations, including those far from the training data. In (d), the student has the same 

4 Ideally, we would apply all methods to all datasets, to enable a proper comparison. Unfortunately, this was 
not possible, for various reasons. First, the open source code for the EP approach only supports regression, so 
we could not evaluate this on classification pro blems. Second, we were not able to run the BBB code, so we just 
quote performance numbers from their paper IBCKW151 . Third, HMC is too slow to run on large problems, so 
we just applied it to the small “toy” problems. Nevertheless, our experiments show that our methods compare 
favorably to these other methods. 
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Model 

SGD 

SGLD 

Distilled 2-10-2 
Distilled 2-100-2 
Distilled 2-10-10-2 


Num. params. 

KL 

40 

0.246 

40k 

0.007 

40 

0.031 

400 

0.014 

140 

0.009 


Table 2: KL divergence on the 2d classification dataset. 


SGD (BCKW15 ! 

Dropout BBB 

SGD (our impl.) 

SGLD 

Dist. SGLD 

1.83 

1.51 1.82 

1.536 ±0.0120 

1.271 ±0.0126 

1.307 ±0.0169 


Table 3: Test set misclassification rate on MNIST for different methods usin g a 784-400-400-10 
MLR SGD (first column), Dropout and BBB numbers are quoted from HBCKW151 . For our impl- 
mentation of SGD (fourth column), SGLD and distilled SGLD, we report the mean misclassification 
rate over 10 runs and its standard error. 


size as the teacher (2-10-2), but this is too simple a model to capture the complexity of the predictive 
distribution (which is an average over models). In (e), the student has a larger hidden layer (2-100- 
2); this works better. However, we get best results using a two hidden layer model (2-10-10-2), as 
shown in (f). 

In Table [2] we show the KL divergence between the HMC distribution (which we consider as ground 
truth) and the various approximations mentioned above. We computed this by comparing the prob¬ 
ability distributions pointwise on a 2d grid. The numbers match the qualitative results shown in 
Figure [T] 


3.2 MNIST classification 

Now we consider the MNIST digit classification problem, which has N = 60k examples, 10 
classes, and D = 784 features. The only preprocessing we do is divide the pixel values by 126 
(as in I BCKW15 1). We train only on 50K datapoints and use the remaining 10K for tuning hyper¬ 
parameters. This means our results are not strictly comparable to a lot of published work, which 
uses the whole dataset for training; however, the difference is likely to be small. 

Following IIBCKW15L we use an MLP with 2 hidden layers with 400 hidden units per layer, ReLU 
activations, and softmax outputs; we denote this by 784-400-400-10. This model has 500k parame¬ 
ters. 

We first fit this model by SGD, using these hyper parameters: fixed learning rate of r/ t = 5 x 10 -6 , 
prior precision A = 1, minibatch size M = 100, number of iterations T = 1 M. As shown in 
Table [3j our final error rate on the test set is 1.536%, which is a bit lower than the SGD number 
reported in I BCKW151L perhaps due to the slightly different training/ validation configuration. 

Next we fit this model by SGLD, using these hyper parameters: fixed learning rate of rj t = 4 x 10 -6 , 
thinning interval r = 100, burn in iterations B = 1000, prior precision A = 1, minibatch size 
M = 100. As shown in Table [3] our final error rate on the test set is about 1.271%, which is better 
than the SGD, dropout and BBB results from lBCKW15 l j^] 

Finally, we consider using distillation, where the teacher is an SGLD MC approximation of the 
posterior predictive. We use the same 784-400-400-10 architecture for the student as well as the 
teacher. We generate data for the student by adding Gaussian noise (with standard deviation of 
0.001) to randomly sampled training point^We use a constant learning rate of pm 0.005, a batch 
size of M = 100, a prior precision of 0.001 (for the student) and train for T = 1 M iterations. We 
obtain a test error of 1.307% which is very close to that obtained with SGLD (see Table]?]). 


5 We only show the BBB results with the same Gaussian prior that we use. Performance of BBB can be 
improved using other priors, such as a scale mixture of Gaussians, as shown in MBCKW15I . Our approach 
could probably also benefit from such a prior, but we did not try this. 

6 In the future, we would like to consider more sophisticated data perturbations, such as elastic distortions. 
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SGD SGLD Distilled SGLD 

-0.0613 ± 0.0002 -0.0419 =b 0.0002 -0.0502 ± 0.0007 

Table 4: Log likelihood per test example on MNIST. We report the mean over 10 trials d= one 
standard error. 


Method 

PBP (as reported i n 1HLA151 T 

VI (as reported in 1HLA151 ) 

SGD 

SGLD 

SGLD distilled 


Avg. test log likelihood 
-2.574 =b 0.089 
-2.903 ±0.071 
-2.7639 ±0.1527 
-2.306 ±0.1205 
-2.350 ± 0.0762 


Table 5: Log likelihood per test example on the Boston housing dataset. We report the mean over 
20 trials ± one standard error. 


We also report the average test log-likelihood of SGD, SGLD and distilled SGLD in Table]?] The 
log-likelihood is equivalent to the logarithmic scoring rule | Bic07 1 used in assessing the calibration 
of probabilistic models. The logarithmic rule is a strictly proper scoring rule, meaning that the 
score is uniquely maximized by predicting the true probabilities. From Table]?] we see that both 
SGLD and distilled SGLD acheive higher scores than SGD, and therefore produce better calibrated 
predictions. 

Note that the SGLD results were obtained by averaging predictions from ~ 10,000 models sampled 
from the posterior, whereas distillation produces a single neural network that approximates the av¬ 
erage prediction of these models, i.e. distillation reduces both storage and test time costs of SGLD 
by a factor of 10,000, without sacrificing much accuracy. In terms of training time, SGD took 1.3 
ms, SGLD took 1.6 ms and distilled SGLD took 3.2 ms per iteration. In terms of memory, distilled 
SGLD requires only twice as much as SGD or SGLD during training, and the same as SGD during 
testing. 

3.3 Toy Id regression 

We start with a toy Id regression problem, in order to visually illustrate the performance of different 
methods. We use the same data and model as I HLA 151. In particular, we use N = 20 points in 
D = 1 dimensions, sampled from the function y = x 3 ± e n , where e n ~ A/*(0, 9). We fit this data 
with an MLP with 10 hidden units and ReLU activations. For SGLD, we use S = 2000 samples. 
For distillation, the teacher uses the same architecture as the student. 

The results are shown in Figure [2] We see that SGLD is a better approximation to the “true” (HMC) 
posterior predictive density than the plugin SGD approximation (which has no predictive uncer¬ 
tainty), and the VI approximation of I Grall l. Finally, we see that distilling SGLD incurs little loss 
in accuracy, but saves a lot computationally. 

3.4 Boston housing 

Finally, we consider a larger regression problem, namely the Boston housing dataset, which was 
also used in 1HLA151 . This has N = 506 data points (456 training, 50 testing), with D = 13 
dimensions. Since this data set is so small, we repeated ah experiments 20 times, using different 
train/ test splits. 

Following I HLA15 1, we use an MLP with 1 layer of 50 hidden units and ReLU activations. First 
we use SGD, with these hyper parameter^] Minibatch size M = 1, noise precision A n = 1.25, 
prior precision A = 1, number of trials 20, constant learning rate rj t = le — 6, number of iterations 
T = 170TG As shown in Table[5] we get an average log likelihood of —2.7639. 

Next we fit the model using SGLD. We use an initial learning rate of 770 = le — 5, which we reduce 
by a factor of 0.5 every 80K iterations; we use 500K iterations, a burnin of 10K, and a thinning 

7 We choose all hyper-parameters using cross-validation whereas I HLA15 1 performs posterior inference on 
the noise and prior precisions, and uses Bayesian optimization to choose the remaining hyper-parameters. 
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Figure 2: Predictive distribution for different methods on a toy Id regression problem, (a) PBP of 
IHLA15I . (b) HMC. (c) VI method of IGralll . (d) SGD. (e) SGLD. (f) Distilled SGLD. Error bars 
denote 3 standard deviations. (Figures a-d kindly provided by the authors of 1HLA151 . We replace 
their term “BP” (backprop) with “SGD” to avoid confusion.) 


interval of 10. As shown in Table [5j we get an average log likelihood of —2.306, which is better 
than SGD. 

Finally, we distill our SGLD model. The student architecture is the same as the teacher. We use the 
following teacher hyper parameters: prior precision A = 2.5; initial learning rate of r]o = le — 5, 
which we reduce by a factor of 0.5 every 80K iterations. For the student, we use generated training 
data with Gaussian noise with standard deviation 0.05, we use a prior precision of 7 = 0.001, an 
initial learning rate of po = le — 2, which we reduce by 0.8 after every 5e3 iterations. As shown 
in Table [5] we get an average log likelihood of —2.350, which is only slightly worse than SGLD, 
and much better than SGD. Furthermore, both SGLD and distilled SGLD are better than the PBP 
method of 1HLA15 1 and the VI method of I Grall l. 


4 Conclusions and future work 


We have shown a very simple method for “being Bayesian” about neural networks (and other kinds 
of models), that seems to work better than recently proposed alternatives based on EP ||HLA15 I and 
VB IGral 1 1 lBCKWl5ll . 

There are various things we would like to do in the future: (1) Show the utility of our model in 
an end-to-end task, where predictive uncertainty is useful (such as with contextual bandits or active 
learning). (2) Consider ways to reduce the variance of the algorithm, perhaps by keeping a running 
minibatch of parameters uniformly sampled from the posterior, which can be done online using 
reservoir sampling. (3) Exploring more intelligent data generation methods for training the student. 
(4) Investigating if our method is able to reduce the prevalence of_confident false predictions on 
adversarially generated examples, such as those discussed in I SZS + 14 1. 
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